rating distribution
Is analogy enough to draw novel adjective-noun inferences?
Ross, Hayley, Davidson, Kathryn, Kim, Najoung
Recent work (Ross et al., 2025, 2024) has argued that the ability of humans and LLMs respectively to generalize to novel adjective-noun combinations shows that they each have access to a compositional mechanism to determine the phrase's meaning and derive inferences. We study whether these inferences can instead be derived by analogy to known inferences, without need for composition. We investigate this by (1) building a model of analogical reasoning using similarity over lexical items, and (2) asking human participants to reason by analogy. While we find that this strategy works well for a large proportion of the dataset of Ross et al. (2025), there are novel combinations for which both humans and LLMs derive convergent inferences but which are not well handled by analogy. We thus conclude that the mechanism humans and LLMs use to generalize in these cases cannot be fully reduced to analogy, and likely involves composition.
Validating LLM-as-a-Judge Systems in the Absence of Gold Labels
Guerdan, Luke, Barocas, Solon, Holstein, Kenneth, Wallach, Hanna, Wu, Zhiwei Steven, Chouldechova, Alexandra
The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, has come to play a critical role in scaling and standardizing GenAI evaluations. To validate judge systems, evaluators collect multiple human ratings for each item in a validation corpus, and then aggregate the ratings into a single, per-item gold label rating. High agreement rates between these gold labels and judge system ratings are then taken as a sign of good judge system performance. In many cases, however, items or rating criteria may be ambiguous, or there may be principled disagreement among human raters. In such settings, gold labels may not exist for many of the items. In this paper, we introduce a framework for LLM-as-a-judge validation in the absence of gold labels. We present a theoretical analysis drawing connections between different measures of judge system performance under different rating elicitation and aggregation schemes. We also demonstrate empirically that existing validation approaches can select judge systems that are highly suboptimal, performing as much as 34% worse than the systems selected by alternative approaches that we describe. Based on our findings, we provide concrete recommendations for developing more reliable approaches to LLM-as-a-judge validation.
Meta-Learning with Adaptive Weighted Loss for Imbalanced Cold-Start Recommendation
Kim, Minchang, Yang, Yongjin, Ryu, Jung Hyun, Kim, Taesup
Sequential recommenders have made great strides in capturing a user's preferences. Nevertheless, the cold-start recommendation remains a fundamental challenge as they typically involve limited user-item interactions for personalization. Recently, gradient-based meta-learning approaches have emerged in the sequential recommendation field due to their fast adaptation and easy-to-integrate abilities. The meta-learning algorithms formulate the cold-start recommendation as a few-shot learning problem, where each user is represented as a task to be adapted. While meta-learning algorithms generally assume that task-wise samples are evenly distributed over classes or values, user-item interactions in real-world applications do not conform to such a distribution (e.g., watching favorite videos multiple times, leaving only positive ratings without any negative ones). Consequently, imbalanced user feedback, which accounts for the majority of task training data, may dominate the user adaptation process and prevent meta-learning algorithms from learning meaningful meta-knowledge for personalized recommendations. To alleviate this limitation, we propose a novel sequential recommendation framework based on gradient-based meta-learning that captures the imbalanced rating distribution of each user and computes adaptive loss for user-specific learning. Our work is the first to tackle the impact of imbalanced ratings in cold-start sequential recommendation scenarios. Through extensive experiments conducted on real-world datasets, we demonstrate the effectiveness of our framework.
Deep Causal Reasoning for Recommendations
Zhu, Yaochen, Yi, Jing, Xie, Jiayi, Chen, Zhenzhong
Traditional recommender systems aim to estimate a user's rating to an item based on observed ratings from the population. As with all observational studies, hidden confounders, which are factors that affect both item exposures and user ratings, lead to a systematic bias in the estimation. Consequently, a new trend in recommender system research is to negate the influence of confounders from a causal perspective. Observing that confounders in recommendations are usually shared among items and are therefore multi-cause confounders, we model the recommendation as a multi-cause multi-outcome (MCMO) inference problem. Specifically, to remedy confounding bias, we estimate user-specific latent variables that render the item exposures independent Bernoulli trials. The generative distribution is parameterized by a DNN with factorized logistic likelihood and the intractable posteriors are estimated by variational inference. Controlling these factors as substitute confounders, under mild assumptions, can eliminate the bias incurred by multi-cause confounders. Furthermore, we show that MCMO modeling may lead to high variance due to scarce observations associated with the high-dimensional causal space. Fortunately, we theoretically demonstrate that introducing user features as pre-treatment variables can substantially improve sample efficiency and alleviate overfitting. Empirical studies on simulated and real-world datasets show that the proposed deep causal recommender shows more robustness to unobserved confounders than state-of-the-art causal recommenders. Codes and datasets are released at https://github.com/yaochenzhu/deep-deconf.
Eliminating Bias in Recommender Systems via Pseudo-Labeling
Addressing the non-uniform missing mechanism of rating feedback is critical to build a well-performing recommeder in the real-world systems. To tackle the challenging issue, we first define an ideal loss function that should be optimized to achieve the goal of recommendation. Then, we derive the generalization error bound of the ideal loss that alleviates the variance and the misspecification problems of the previous propensity-based methods. We further propose a meta-learning method minimizing the bound. Empirical evaluation using real-world datasets validates the theoretical findings and demonstrates the practical advantages of the proposed upper bound minimization approach.
Securing Behavior-based Opinion Spam Detection
Ge, Shuaijun, Ma, Guixiang, Xie, Sihong, Yu, Philip S.
Reviews spams are prevalent in e-commerce to manipulate product ranking and customers decisions maliciously. While spams generated based on simple spamming strategy can be detected effectively, hardened spammers can evade regular detectors via more advanced spamming strategies. Previous work gave more attention to evasion against text and graph-based detectors, but evasions against behavior-based detectors are largely ignored, leading to vulnerabilities in spam detection systems. Since real evasion data are scarce, we first propose EMERAL (Evasion via Maximum Entropy and Rating sAmpLing) to generate evasive spams to certain existing detectors. EMERAL can simulate spammers with different goals and levels of knowledge about the detectors, targeting at different stages of the life cycle of target products. We show that in the evasion-defense dynamic, only a few evasion types are meaningful to the spammers, and any spammer will not be able to evade too many detection signals at the same time. We reveal that some evasions are quite insidious and can fail all detection signals. We then propose DETER (Defense via Evasion generaTion using EmeRal), based on model re-training on diverse evasive samples generated by EMERAL. Experiments confirm that DETER is more accurate in detecting both suspicious time window and individual spamming reviews. In terms of security, DETER is versatile enough to be vaccinated against diverse and unexpected evasions, is agnostic about evasion strategy and can be released without privacy concern.
Adversarial Recommendation: Attack of the Learned Fake Users
Christakopoulou, Konstantina, Banerjee, Arindam
Can machine learning models for recommendation be easily fooled? While the question has been answered for hand-engineered fake user profiles, it has not been explored for machine learned adversarial attacks. This paper attempts to close this gap. We propose a framework for generating fake user profiles which, when incorporated in the training of a recommendation system, can achieve an adversarial intent, while remaining indistinguishable from real user profiles. We formulate this procedure as a repeated general-sum game between two players: an oblivious recommendation system $R$ and an adversarial fake user generator $A$ with two goals: (G1) the rating distribution of the fake users needs to be close to the real users, and (G2) some objective $f_A$ encoding the attack intent, such as targeting the top-K recommendation quality of $R$ for a subset of users, needs to be optimized. We propose a learning framework to achieve both goals, and offer extensive experiments considering multiple types of attacks highlighting the vulnerability of recommendation systems.
BIRDNEST: Bayesian Inference for Ratings-Fraud Detection
Hooi, Bryan, Shah, Neil, Beutel, Alex, Gunnemann, Stephan, Akoglu, Leman, Kumar, Mohit, Makhija, Disha, Faloutsos, Christos
Review fraud is a pervasive problem in online commerce, in which fraudulent sellers write or purchase fake reviews to manipulate perception of their products and services. Fake reviews are often detected based on several signs, including 1) they occur in short bursts of time; 2) fraudulent user accounts have skewed rating distributions. However, these may both be true in any given dataset. Hence, in this paper, we propose an approach for detecting fraudulent reviews which combines these 2 approaches in a principled manner, allowing successful detection even when one of these signs is not present. To combine these 2 approaches, we formulate our Bayesian Inference for Rating Data (BIRD) model, a flexible Bayesian model of user rating behavior. Based on our model we formulate a likelihood-based suspiciousness metric, Normalized Expected Surprise Total (NEST). We propose a linear-time algorithm for performing Bayesian inference using our model and computing the metric. Experiments on real data show that BIRDNEST successfully spots review fraud in large, real-world graphs: the 50 most suspicious users of the Flipkart platform flagged by our algorithm were investigated and all identified as fraudulent by domain experts at Flipkart.
Pre-release Prediction of Crowd Opinion on Movies by Label Distribution Learning
Geng, Xin (Southeast University) | Hou, Peng (Southeast University)
This paper studies an interesting problem: is it possible to predict the crowd opinion about a movie before the movie is actually released? The crowd opinion is here expressed by the distribution of ratings given by a sufficient amount of people. Consequently, the pre-release crowd opinion prediction can be regarded as a Label Distribution Learning (LDL) problem. In order to solve this problem, a Label Distribution Support Vector Regressor (LDSVR) is proposed in this paper. The basic idea of LDSVR is to fit a sigmoid function to each component of the label distribution simultaneously by a multi-output support vector machine. Experimental results show that LDSVR can accurately predict peoples’s rating distribution about a movie just based on the pre-release metadata of the movie.
Transfer Learning in Collaborative Filtering with Uncertain Ratings
Pan, Weike (Hong Kong University of Science and Technology) | Xiang, Evan W. (Hong Kong University of Science and Technology) | Yang, Qiang (Hong Kong University of Science and Technology)
To solve the sparsity problem in collaborative filtering, researchers have introduced transfer learning as a viable approach to make use of auxiliary data. Most previous transfer learning works in collaborative filtering have focused on exploiting point-wise ratings such as numerical ratings, stars, or binary ratings of likes/dislikes. However, in many real-world recommender systems, many users may be unwilling or unlikely to rate items with precision.In contrast, practitioners can turn to various non-preference data to estimate a range or rating distribution of a user's preference on an item.Such a range or rating distribution is called an uncertain rating since it represents a rating spectrum of uncertainty instead of an accurate point-wise score. In this paper, we propose an efficient transfer learning solution for collaborative filtering, known as {\em transfer by integrative factorization} (TIF), to leverage such auxiliary uncertain ratings to improve the performance of recommendation. In particular, we integrate auxiliary data of uncertain ratings as additional constraints in the target matrix factorization problem, and learn an expected rating value for each uncertain rating automatically. The advantages of our proposed approach include the efficiency and the improved effectiveness of collaborative filtering, showing that incorporating the auxiliary data of uncertain ratings can really bring a benefit. Experimental results on two movie recommendation tasks show that our TIF algorithm performs significantly better over a state-of-the-art non-transfer learning method.